Pinvon's Blog

所见, 所闻, 所思, 所想

pgmpy 教程

创建贝叶斯网

在 pgmpy 中, 定义一个贝叶斯网的流程一般是先建立网络结构, 然后填入相关参数.

建立网络结构

from pgmpy.models import BayesianModel

cancer_model = BayesianModel([('Pollution', 'Cancer'), 
                              ('Smoker', 'Cancer'),
                              ('Cancer', 'Xray'),
                              ('Cancer', 'Dyspnoea')])

这个网络中有五个节点: Pollution, Cancer, Smoker, Xray, Dyspnoea.

('Pollution', 'Cancer'): 一条有向边, 从 Pollution 指向 Cancer, 表示环境污染有可能导致癌症.

('Smoker', 'Cancer'): 吸烟有可能导致癌症.

('Cancer', 'Xray'): 得癌症的人可能会去照X射线.

('Cancer', 'Dyspnoea'): 得癌症的人可能会呼吸困难.

填入参数

from pgmpy.factors.discrete import TabularCPD

cpd_poll = TabularCPD(variable='Pollution', variable_card=2,
                      values=[[0.9], [0.1]])
cpd_smoke = TabularCPD(variable='Smoker', variable_card=2,
                       values=[[0.3], [0.7]])
cpd_cancer = TabularCPD(variable='Cancer', variable_card=2,
                        values=[[0.03, 0.05, 0.001, 0.02],
                                [0.97, 0.95, 0.999, 0.98]],
                        evidence=['Smoker', 'Pollution'],
                        evidence_card=[2, 2])
cpd_xray = TabularCPD(variable='Xray', variable_card=2,
                      values=[[0.9, 0.2], [0.1, 0.8]],
                      evidence=['Cancer'], evidence_card=[2])
cpd_dysp = TabularCPD(variable='Dyspnoea', variable_card=2,
                      values=[[0.65, 0.3], [0.35, 0.7]],
                      evidence=['Cancer'], evidence_card=[2])

这些代码, 主要是建立一些概率表, 然后往表里面填入了一些参数.

Pollution: 有两种概率, 分别是 0.9 和 0.1.

Smoker: 有两种概率, 分别是 0.3 和 0.7. (意思是在一个人群里, 有 30% 的人吸烟, 有 70% 的人不吸烟)

Cancer: envidence 表示有 Smoker 和 Pollution 两个节点指向 Cancer 节点;

[0.03, 0.05, 0.001, 0.02] 的意思, 通过下表可以较容易看出来:

Pollution Pollution_0 Pollution_0 Pollution_1 Pollution_1
Smoker Smoker_0 Smoker_1 Smoker_0 Smoker_1
Cancer_0 0.03 0.05 0.001 0.02
Cancer_1 0.97 0.95 0.999 0.98

可以看出, 环境不好, 又抽烟的人, 有 0.98 的概率得癌症.

Xray: 通过 envidence 可以看出, 只有 Cancer 指向它; [0.9, 0.2] 表示, 未得癌症的人去照X光, 有 0.9 的概率正常, 而得了癌症的人去照X光, 有 0.2 的概率正常, [0.1, 0.8] 表示, 未得癌症的人去照X光, 有 0.1 的概率不正常, 而得癌症的人有 0.8 的概率不正常.

Dyspnoea: 未得癌症的人有 0.65 的可能性呼吸正常, 得癌症的人只有 0.3 的可能性呼吸正常; 未得癌症的人有 0.35 的可能性呼吸困难, 得癌症的人有 0.7 的可能性呼吸困难.

将数据加入网络

# Associating the parameters with the model structure.
cancer_model.add_cpds(cpd_poll, cpd_smoke, cpd_cancer, cpd_xray, cpd_dysp)

# Checking if the cpds are valid for the model.
cancer_model.check_model()

第一行代码表示将概率分布表加入网络中; 第二行代码验证模型数据的正确性(至于怎么验证的, 我没有看源码, 但是猜测是验证相关数据的和是否为1, 验证数据中是否有负数, 等等).

相关的检查

cancer_model.is_active_trail('Pollution', 'Smoker')  # False
cancer_model.is_active_trail('Pollution', 'Smoker', observed=['Cancer'])  # True
cancer_model.local_independencies('Xray')  # (Xray _|_ Dyspnoea, Smoker, Pollution | Cancer)
cancer_model.get_independencies()

第一行代码表示, Pollution 节点是否指向 Smoker 节点, 返回的是 False.

第二行代码表示, Cancer 节点的计算, 是否与 Pollution 和 Smoker 两个节点有关, 返回的是 True.

第三行代码的输出表示, Xray 与 Dyspnoea, Smoker, Pollution 不相关, 与 Cancer 相关.

第四行代码输出所有的依赖关系.

使用 Asia 网络进行推理

在构建了贝叶斯网之后, 我们使用贝叶斯网来进行推理. 推理算法分精确推理和近似推理. 精确推理有变量消元法和团树传播法; 近似推理算法是基于随机抽样的算法.

Asia 网络是早期贝叶斯网文献中给出的一个网络, 与肺结核, 肺癌, 支气管炎等有关, 和我们之前的那个网络很相似. 我们使用 Asia 网络来看一下如何进行推理.

创建源文件 asia.py

下载网络模型放到源文件目录下:

wget http://www.bnlearn.com/bnrepository/asia/asia.bif.gz
gzip -qd asia.bif.gz

编辑 asia.py:

导入 Asia 网络:

from pgmpy.readwrite import BIFReader
reader = BIFReader('asia.bif')
asia_model = reader.get_model()

输出节点信息:

print(asia_model.nodes())

# ['xray', 'bronc', 'asia', 'dysp', 'lung', 'either', 'smoke', 'tub']

输出依赖关系:

print(asia_model.edges())

结果如下:

[('bronc', 'dysp'),
 ('asia', 'tub'),
 ('lung', 'either'),
 ('either', 'dysp'),
 ('either', 'xray'),
 ('smoke', 'bronc'),
 ('smoke', 'lung'),
 ('tub', 'either')]

查看概率分布:

print(asia_model.get_cpds())

结果如下:

[<TabularCPD representing P(asia:2) at 0x7f92cf1e46d0>,
 <TabularCPD representing P(bronc:2 | smoke:2) at 0x7f92ea70f990>,
 <TabularCPD representing P(dysp:2 | bronc:2, either:2) at 0x7f92f80f4ed0>,
 <TabularCPD representing P(either:2 | lung:2, tub:2) at 0x7f92e8e71dd0>,
 <TabularCPD representing P(lung:2 | smoke:2) at 0x7f92cf1e4c50>,
 <TabularCPD representing P(smoke:2) at 0x7f92cf1e4ed0>,
 <TabularCPD representing P(tub:2 | asia:2) at 0x7f92cf1ec210>,
 <TabularCPD representing P(xray:2 | either:2) at 0x7f92cf1ec410>]

直观显示某节点的概率分布:

print(asia_model.get_cpds('xray').values)

结果如下:

[[0.98 0.05]
 [0.02 0.95]]

变量消除法

变量消除法是精确推断的一种方法.

from pgmpy.inference import VariableElimination
asia_infer = VariableElimination(asia_model)
q = asia_infer.query(variables=['bronc'], evidence={'smoke': 0})
print(q['bronc'])

结果如下:

+---------+--------------+
| bronc   |   phi(bronc) |
|---------+--------------|
| bronc_0 |       0.6000 |
| bronc_1 |       0.4000 |
+---------+--------------+

意思是, 在不吸烟的情况下, 得支气管炎的概率是 0.4, 未得支气管炎的概率是 0.6.

在在吸烟情况下, 得支气管炎的概率和未得支气管炎的概率可以这样查询:

q = asia_infer.query(variables=['bronc'], evidence={'smoke': 1})
print(q['bronc'])

结果如下:

+---------+--------------+
| bronc   |   phi(bronc) |
|---------+--------------|
| bronc_0 |       0.3000 |
| bronc_1 |       0.7000 |
+---------+--------------+

利用训练数据学习

场景: 投硬币, 训练数据中, 有 30% 的正面朝上, 有 70% 的反面朝上. 我们使用极大似然估计和狄利克雷分布下贝叶斯参数先验估计硬币的条件概率分布.

生成数据:

import numpy as np
import pandas as pd
raw_data = np.array([0] * 30 + [1] * 70)
data = pd.DataFrame(raw_data, columns=['coin'])

定义贝叶斯网:

from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
model = BayesianModel()
model.add_node('coin')

使用极大似然估计:

model.fit(data, estimator=MaximumLikelihoodEstimator)
print(model.get_cpds('coin'))

结果如下:

+---------+-----+
| coin(0) | 0.3 |
+---------+-----+
| coin(1) | 0.7 |
+---------+-----+

使用狄利克雷分布作为先验分布的贝叶斯推论:

model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts={'coin': [50, 50]})
print(model.get_cpds('coin'))

结果如下:

+---------+-----+
| coin(0) | 0.4 |
+---------+-----+
| coin(1) | 0.6 |
+---------+-----+

更复杂的例子(学生例子)

学生例子中包含5个随机变量, 如下所示:

变量 含义 取值
Difficulty 课程本身难度 0=easy, 1=difficult
Intelligence 学生聪明程度 0=stupid, 1=smart
Grade 学生课程成绩 1=A, 2=B, 3=C
SAT 学生高考成绩 0=low, 1=high
Letter 可否获得推荐信 0=未获得, 1=获得

生成数据:

import numpy as np
import pandas as pd
raw_data = np.random.randint(low=0, high=2, size=(1000, 5))
data = pd.DataFrame(raw_data, columns=['D', 'I', 'G', 'L', 'S'])

定义模型:

from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
model = BayesianModel([('D', 'G'), ('I', 'G'), ('I', 'S'), ('G', 'L')])
model.fit(data, estimator=MaximumLikelihoodEstimator)
for cpd in model.get_cpds():
    print("CPD of {variable}:".format(variable=cpd.variable))
    print(cpd)

结果如下:

CPD of D:
+------+-------+
| D(0) | 0.501 |
+------+-------+
| D(1) | 0.499 |
+------+-------+
CPD of G:
+------+------+----------------+----------------+----------------+
| D    | D(0) | D(0)           | D(1)           | D(1)           |
+------+------+----------------+----------------+----------------+
| I    | I(0) | I(1)           | I(0)           | I(1)           |
+------+------+----------------+----------------+----------------+
| G(0) | 0.48 | 0.509960159363 | 0.444915254237 | 0.551330798479 |
+------+------+----------------+----------------+----------------+
| G(1) | 0.52 | 0.490039840637 | 0.555084745763 | 0.448669201521 |
+------+------+----------------+----------------+----------------+
CPD of I:
+------+-------+
| I(0) | 0.486 |
+------+-------+
| I(1) | 0.514 |
+------+-------+
CPD of L:
+------+----------------+----------------+
| G    | G(0)           | G(1)           |
+------+----------------+----------------+
| L(0) | 0.489959839357 | 0.501992031873 |
+------+----------------+----------------+
| L(1) | 0.510040160643 | 0.498007968127 |
+------+----------------+----------------+
CPD of S:
+------+----------------+----------------+
| I    | I(0)           | I(1)           |
+------+----------------+----------------+
| S(0) | 0.512345679012 | 0.468871595331 |
+------+----------------+----------------+
| S(1) | 0.487654320988 | 0.531128404669 |
+------+----------------+----------------+

参数学习:

pseudo_counts = {'D': [300, 700], 'I': [500, 500], 'G': [800, 200], 'L': [500, 500], 'S': [400, 600]}
model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts=pseudo_counts)
for cpd in model.get_cpds():
    print("CPD of {variable}:".format(variable=cpd.variable))
    print(cpd)

结果如下:

CPD of D:
+------+--------+
| D(0) | 0.4005 |
+------+--------+
| D(1) | 0.5995 |
+------+--------+
CPD of G:
+------+-------+----------------+----------------+----------------+
| D    | D(0)  | D(0)           | D(1)           | D(1)           |
+------+-------+----------------+----------------+----------------+
| I    | I(0)  | I(1)           | I(0)           | I(1)           |
+------+-------+----------------+----------------+----------------+
| G(0) | 0.736 | 0.741806554756 | 0.732200647249 | 0.748218527316 |
+------+-------+----------------+----------------+----------------+
| G(1) | 0.264 | 0.258193445244 | 0.267799352751 | 0.251781472684 |
+------+-------+----------------+----------------+----------------+
CPD of I:
+------+-------+
| I(0) | 0.493 |
+------+-------+
| I(1) | 0.507 |
+------+-------+
CPD of L:
+------+----------------+----------------+
| G    | G(0)           | G(1)           |
+------+----------------+----------------+
| L(0) | 0.496662216288 | 0.500665778961 |
+------+----------------+----------------+
| L(1) | 0.503337783712 | 0.499334221039 |
+------+----------------+----------------+
CPD of S:
+------+----------------+----------------+
| I    | I(0)           | I(1)           |
+------+----------------+----------------+
| S(0) | 0.436742934051 | 0.423381770145 |
+------+----------------+----------------+
| S(1) | 0.563257065949 | 0.576618229855 |
+------+----------------+----------------+

Comments

使用 Disqus 评论
comments powered by Disqus